NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Revisiting the Folklore Algorithm for Random Access to Grammar-Compressed Strings

Cleary, Alan M; Winjum, Joseph; Dood, Jordan; Inenaga, Shunsuke (September 2024, Springer Nature Switzerland)
Lipták, Zsuzsanna; Moura, Edleno; Figueroa, Karina; Baeza-Yates, Ricardo (Ed.)
Grammar-based compression is a widely-accepted model of string compression that allows for efficient and direct manipulations on the compressed data. Most, if not all, such manipulations rely on the primitive random access queries, a task of quickly returning the character at a specified position of the original uncompressed string without explicit decompression. While there are advanced data structures for random access to grammar-compressed strings that guarantee theoretical query time and space bounds, little has been done for the practical perspective of this important problem. In this paper, we revisit a well-known folklore random access algorithm for grammars in the Chomsky normal form, modify it to work directly on general grammars, and show that this modified version is fast and memory efficient in practice.
more » « less
Full Text Available
Novel Grammar-Based Compression Algorithms for Pangenome Analysis

Dood, Jordan; Cleary, Alan M. (June 2023, Sequencing, Finishing and Analysis in the Future)

Recent advancements in DNA sequencing and assembly have drastically lowered cost and improved quality. This has allowed for collections of genomes to be created that better reflect the variability within a single species. These pangenomes continue to grow in size and scope as new sequences are added, yet such collections have already proven to be challenging to handle without significant computational infrastructure, with the primary challenge being the large data size. Unfortunately, existing compression algorithms do not allow analysis to be performed directly on the compressed data. Furthermore, many common compression paradigms do not take advantage of the high similarity between genomes from the same species, resulting in compression that scales relative to data size rather than relative to information content. In this work, we present and propose novel grammar-based compression algorithms designed specifically for pangenome analysis. By leveraging maximal repeats, these algorithms have the potential to enable pangenome analysis at unprecedented scales.
more » « less
Full Text Available
Constructing the CDAWG CFG using LCP-Intervals

https://doi.org/10.1109/DCC55655.2023.00026

Cleary, Alan M.; Dood, Jordan (March 2023, 2023 Data Compression Conference)

It is known that a context-free grammar (CFG) that produces a single string can be derived from the compact directed acyclic word graph (CDAWG) for the same string. In this work, we show that the CFG derived from a CDAWG is deeply connected to the maximal repeat content of the string it produces and thus has O(m) rules, where m is the number of maximal repeats in the string. We then provide a generic algorithm based on this insight for constructing the CFG from the LCP-intervals of a string in O(n) time, where n is the length of the string. This includes a novel data-structure to support stabbing queries on LCPintervals in O(1+k) time after O(n) preprocessing time, where k is the number of intervals stabbed. These results connect the CFG to properties of the string it produces and relates it to other string data-structures, allowing it to be studied independently of the CDAWG and providing opportunity for innovation of grammar-based compression algorithms.
more » « less
Full Text Available
Exploring Frequented Regions in Pan-Genomic Graphs

https://doi.org/10.1109/TCBB.2018.2864564

Cleary, Alan; Ramaraj, Thiruvarangan; Kahanda, Indika; Mudge, Joann; Mumey, Brendan (August 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics)

Full Text Available
Pangenome-Wide Association Studies with Frequented Regions

https://doi.org/10.1145/3307339.3343478

Manuweera, Buwani; Mudge, Joann; Kahanda, Indika; Mumey, Brendan; Ramaraj, Thiruvarangan; Cleary, Alan (September 2019, Pangenome-Wide Association Studies with Frequented Regions)

Full Text Available
The future of legume genetic data resources: Challenges, opportunities, and priorities

https://doi.org/10.1002/leg3.16

Bauchet, Guillaume J.; Bett, Kirstin E.; Cameron, Connor T.; Campbell, Jacqueline D.; Cannon, Ethalinda K. S.; Cannon, Steven B.; Carlson, Joseph W.; Chan, Agnes; Cleary, Alan; Close, Timothy J.; et al (November 2019, Legume Science)

Abstract Legumes, comprising one of the largest, most diverse, and most economically important plant families, are the subject of vibrant research and development worldwide. Continued improvement of legume crops will benefit from the recent proliferation of genetic (including genomic) resources; but the diversity, scale, and complexity of these resources presents challenges to those managing and using them. A workshop held in March of 2019 addressed questions of data resources and priorities for the legumes. The workshop identified various needs and recommendations: (a) Develop strategies to effectively store, integrate, and relate genetic resources collected in different projects. (b) Leverage information collected across many legume species by standardizing data formats and ontologies, improving the state of metadata about datasets, and increasing use of the FAIR data principles. (c) Advocate for the critical role that curators exercise in integrating complex datasets into databases and adding high value metadata that enable downstream analytics and facilitate practical applications. (d) Implement standardized software and database development practices to best leverage limited developer time and expertise gained from the various legume (and other) species. (e) Develop tools and databases that can manage genetic information for the world's plant genetic resources, enabling efficient incorporation of important traits into breeding programs. (f) Centralize information on databases, tools, and training materials and establish funding streams to support training and outreach.
more » « less

Search for: All records